This lecture is about complex data structures, or non-rectangular data.
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
Recall the air quality data you have worked with as part of the exercises. We can store this data in the form of a CSV file as illustrated below. Commas separate columns, new lines separate rows, and the first row contains the column/variable names.
unique_id,indicator_id,name,measure,measure_info,geo_type_name,
geo_join_id,geo_place_name,time_period,start_date,data_value
216498,386,Ozone (O3),Mean,ppb,CD,
313,Coney Island (CD13),Summer 2013,2013-06-01T00:00:00,34.64
216499,386,Ozone (O3),Mean,ppb,CD,
313,Coney Island (CD13),Summer 2014,2014-06-01T00:00:00,33.22
219969,386,Ozone (O3),Mean,ppb,Borough,
1,Bronx,Summer 2013,2013-06-01T00:00:00,31.25
The same data can also be stored as an XML file. The first few lines of this file could look like this:
<row>
<unique_id>216498</unique_id>
<indicator_id>386</indicator_id>
<name>Ozone (O3)</name>
<measure>Mean</measure>
<measure_info>ppb</measure_info>
<geo_type_name>CD</geo_type_name>
<geo_join_id>313</geo_join_id
><geo_place_name>Coney Island (CD13)</geo_place_name>
<time_period>Summer 2013</time_period>
<start_date>2013-06-01T00:00:00</start_date>
<data_value>34.64</data_value>
</row>
<unique_id>216499</unique_id>
<indicator_id>386</indicator_id>
...
<\row><, >, and / ) give the data
structure.<variablename>value</variablename>.For example, the entire air quality dataset content we know from the csv example above is nested between the ‘`row``’-tags:
row is the “root-element” of
the XML-document.row``* contains several *row* elements, which in turn contain several tags/variables describing a unique record (such astime_period`).It becomes obvious that xml files follow a tree structure:
There are two principal ways to link variable names and data values in XML:
<variablename>value</variablename>. In the
example below:
<filename>ISCCPMonthly_avg.nc</filename>.<observation variablename="value">. In the example
below:
<case date="16-JAN-1994" temperature="9.200012" /> <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />The same information can be stored either way, as the following example shows:
Attributes-based:
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>Note the key differences of storing data in XML format in contrast to a flat, table-like format such as CSV:
Potential drawback of XML: inefficient storage:
The following two data samples show the same information once stored in an XML file and once in a JSON file:
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}Data structured according to either XML or JSON syntax can be thought of as following a tree-like structure:
The following examples are based on the example code shown above (the
two text-files persons.json and
persons.xml)
# load packages
library(xml2)
# parse XML, represent XML document as R object
xml_doc <- read_xml("persons.xml")
xml_doc## {xml_document}
## <person>
## [1] <firstName>John</firstName>
## [2] <lastName>Smith</lastName>
## [3] <age>25</age>
## [4] <address>\n <streetAddress>21 2nd Street</streetAddress>\n <city>New York</city>\n <state> ...
## [5] <phoneNumber>\n <type>home</type>\n <number>212 555-1234</number>\n</phoneNumber>
## [6] <phoneNumber>\n <type>fax</type>\n <number>646 555-4567</number>\n</phoneNumber>
## [7] <gender>\n <type>male</type>\n</gender>
‘customers’ is the root-node, ‘persons’ are its children:
## {xml_nodeset (7)}
## [1] <firstName>John</firstName>
## [2] <lastName>Smith</lastName>
## [3] <age>25</age>
## [4] <address>\n <streetAddress>21 2nd Street</streetAddress>\n <city>New York</city>\n <state> ...
## [5] <phoneNumber>\n <type>home</type>\n <number>212 555-1234</number>\n</phoneNumber>
## [6] <phoneNumber>\n <type>fax</type>\n <number>646 555-4567</number>\n</phoneNumber>
## [7] <gender>\n <type>male</type>\n</gender>
You can thus navigate downwards from the root-node to specific
leaf-nodes via xml_children(). In addition you can navigate
horizontally or upwards via xml_siblings() and
xml_parents(), respectively.
## {xml_nodeset (1)}
## [1] <firstName>John</firstName>
## {xml_nodeset (6)}
## [1] <lastName>Smith</lastName>
## [2] <age>25</age>
## [3] <address>\n <streetAddress>21 2nd Street</streetAddress>\n <city>New York</city>\n <state> ...
## [4] <phoneNumber>\n <type>home</type>\n <number>212 555-1234</number>\n</phoneNumber>
## [5] <phoneNumber>\n <type>fax</type>\n <number>646 555-4567</number>\n</phoneNumber>
## [6] <gender>\n <type>male</type>\n</gender>
## {xml_nodeset (1)}
## [1] <person>\n <firstName>John</firstName>\n <lastName>Smith</lastName>\n <age>25</age>\n <ad ...
Advanced topic: extract specific parts of the data via
XPATH.xml_find_all() allows you to find any data values
with specific characteristics as defined by the
xpath-argument.1
# find data via XPath
customer_names <- xml_find_all(xml_doc, xpath = ".//name")
# extract the data as text
xml_text(customer_names)## character(0)
Similar to the case of XML, there are several R-packages providing
functions to import and work with JSON. Here, we work with the
easy-to-use jsonlite-package.
# load packages
library(jsonlite)
# parse the JSON-document shown in the example above
json_doc <- fromJSON("data/person.json")
# look at the structure of the document
str(json_doc)## List of 6
## $ firstName : chr "John"
## $ lastName : chr "Smith"
## $ age : int 25
## $ address :List of 4
## ..$ streetAddress: chr "21 2nd Street"
## ..$ city : chr "New York"
## ..$ state : chr "NY"
## ..$ postalCode : chr "10021"
## $ phoneNumber:'data.frame': 2 obs. of 2 variables:
## ..$ type : chr [1:2] "home" "fax"
## ..$ number: chr [1:2] "212 555-1234" "646 555-4567"
## $ gender :List of 1
## ..$ type: chr "male"
The nesting structure is represented as a nested list:
## $streetAddress
## [1] "21 2nd Street"
##
## $city
## [1] "New York"
##
## $state
## [1] "NY"
##
## $postalCode
## [1] "10021"
## [1] "male"
HyperText Markup Language (HTML) is designed to be read and rendered by a web browser. Yet, web pages (HTML-documents) also contain tables, raw text, and images and thus they are also a file format to store data.
The following short HTML-file constitues a very simple web page:
head and body are nested within the
html documenthead, we define the title,
etc.<html>..</html><head>...</head>,
<body>...</body><head>...</head>,
<body>...</body>HTML (DOM) tree diagram.
In this example, we look at Wikipedia’s Economy of Switzerland page.
## Warning in readLines("https://en.wikipedia.org/wiki/Economy_of_Switzerland"): incomplete final line
## found on 'https://en.wikipedia.org/wiki/Economy_of_Switzerland'
Look at the first few imported lines:
## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\">"
## [5] "<title>Economy of Switzerland - Wikipedia</title>"
## [6] "<script>(function(){var className=\"client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available\";var cookie=document.cookie.match(/(?:^|; )enwikimwclientpreferences=([^;]+)/);if(cookie){cookie[1].split('%2C').forEach(function(pref){className=className.replace(new RegExp('(^| )'+pref.replace(/-clientpref-\\w+$|[^\\w-]+/g,'')+'-clientpref-\\\\w+( |$)'),'$1'+pref+'$2');});}document.documentElement.className=className;}());RLCONF={\"wgBreakFrames\":false,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],"
Select specific lines (select specific parts of the data):
## [1] "\t\t\t\t"
# install package if not yet installed
# install.packages("rvest")
# load the package
library(rvest)# parse the webpage, show the content
swiss_econ_parsed <- read_html("https://en.wikipedia.org/wiki/Economy_of_Switzerland")
swiss_econ_parsed## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="U ...
## [2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr mw-hide-empty-elt n ...
Now we can easily separate the data/text from the html code. For
example, we can extract the HTML table containing the data we are
interested in as a data.frames.
tab_node <- html_node(swiss_econ_parsed,
xpath = "//*[@id='mw-content-text']/div/table[2]")
tab <- html_table(tab_node)
tab## # A tibble: 19 × 3
## Year `GDP (billions of CHF)` `US dollar exchange`
## <int> <int> <chr>
## 1 1980 184 1.67 Francs
## 2 1985 244 2.43 Francs
## 3 1990 331 1.38 Francs
## 4 1995 374 1.18 Francs
## 5 2000 422 1.68 Francs
## 6 2005 464 1.24 Francs
## 7 2006 491 1.25 Francs
## 8 2007 521 1.20 Francs
## 9 2008 547 1.08 Francs
## 10 2009 535 1.09 Francs
## 11 2010 546 1.04 Francs
## 12 2011 659 0.89 Francs
## 13 2012 632 0.94 Francs
## 14 2013 635 0.93 Francs
## 15 2014 644 0.92 Francs
## 16 2015 646 0.96 Francs
## 17 2016 659 0.98 Francs
## 18 2017 668 1.01 Francs
## 19 2018 694 1.00 Francs
First few steps in a text analysis/natural language processing (NLP) pipeline:
The package quanteda is the most complete and go-to
package for text analysis in R. In order to run quanteda,
several packages need to be installed. You can use the following command
to make sure that missing packages are installed.
The base, raw material, of quantitative text analysis is a corpus. A corpus is, in NLP, a collection of authentic text organized into datasets.
In the specific case of quanteda, a corpus is a
a data frame consisting of a character vector for documents, and
additional vectors for document-level variables. In other
words, a corpus is a data frame that contains, in each row, a text
document, and additional columns with the corresponding metadata about
the text.
In the examples below, we will use the inauguration
corpus from quanteda, which is a standard corpus used in
introductory text analysis. It contains the inauguration discourses of
the five first US presidents. This text data can be loaded from the
readtext package. The metadata of this corpus is the year
of the inauguration and the name of the president taking office.
# set path
path_data <- system.file("extdata/", package = "readtext")
# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
names(dat_inaug)## [1] "texts" "Year" "President" "FirstName"
## Corpus consisting of 5 documents and 3 docvars.
## text1 :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## text2 :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## text3 :
## "When it was first perceived, in early times, that no middle ..."
##
## text4 :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## text5 :
## "Proceeding, fellow citizens, to that qualification which the..."
## Year President FirstName
## 1 1789 Washington George
## 2 1793 Washington George
## 3 1797 Adams John
## 4 1801 Jefferson Thomas
## 5 1805 Jefferson Thomas
# In quanteda, the metadata in a corpus can be handled like data frames.
docvars(corp, field = "Century") <- floor(docvars(corp, field = "Year") / 100) + 1
Once we have a corpus, we want to extract the substance of the text.
This means, in quanteda language, that we want to extract
tokens, i.e. to isolate the elements that constitute a
corpus in order to quantify them. Basically, tokens are expressions that
form the building blocks of the text. Tokens can be single words or
phrases (several subsequent words, so-called N-grams).
## [1] "Fellow-Citizens" "of" "the" "Senate" "and"
## [6] "of" "the" "House" "of" "Representatives"
## [11] ":" "Among" "the" "vicissitudes" "incident"
## [16] "to" "life" "no" "event" "could"
## [1] "Fellow-Citizens" "of" "the" "Senate" "and"
## [6] "of" "the" "House" "of" "Representatives"
## [11] "Among" "the" "vicissitudes" "incident" "to"
## [16] "life" "no" "event" "could" "have"
## [1] "i" "me" "my" "myself" "we" "our" "ours"
## [8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he"
## [15] "him" "his" "himself" "she" "her" "hers" "herself"
## [22] "it" "its" "itself" "they" "them" "their" "theirs"
## [29] "themselves" "what" "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are" "was" "were"
## [43] "be" "been" "being" "have" "has" "had" "having"
## [50] "do" "does" "did" "doing" "would" "should" "could"
## [57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're"
## [64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll" "you'll" "he'll"
## [78] "she'll" "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" "won't"
## [92] "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't"
## [99] "let's" "that's" "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an" "the" "and"
## [113] "but" "if" "or" "because" "as" "until" "while"
## [120] "of" "at" "by" "for" "with" "about" "against"
## [127] "between" "into" "through" "during" "before" "after" "above"
## [134] "below" "to" "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again" "further" "then"
## [148] "once" "here" "there" "when" "where" "why" "how"
## [155] "all" "any" "both" "each" "few" "more" "most"
## [162] "other" "some" "such" "no" "nor" "not" "only"
## [169] "own" "same" "so" "than" "too" "very" "will"
## [1] "Fellow-Citizens" "Senate" "House" "Representatives" "Among"
## [6] "vicissitudes" "incident" "life" "event" "filled"
## [11] "greater" "anxieties" "notification" "transmitted" "order"
## [16] "received" "14th" "day" "present" "month"
# We can keep words we are interested in
tokens_select(toks, pattern = c("peace", "war", "great*", "unit*"))## Tokens consisting of 5 documents and 4 docvars.
## text1 :
## [1] "greater" "United" "Great" "United" "united" "great" "great" "united"
##
## text2 :
## [1] "united"
##
## text3 :
## [1] "war" "great" "United" "great" "great" "peace" "great" "peace" "peace" "United"
## [11] "peace" "peace"
## [ ... and 2 more ]
##
## text4 :
## [1] "greatness" "unite" "unite" "greater" "peace" "peace" "peace" "war"
## [9] "peace" "greatest" "greatest" "great"
## [ ... and 1 more ]
##
## text5 :
## [1] "United" "peace" "great" "war" "war" "War" "peace" "peace" "peace"
# Remove "fellow" and "citizen"
toks <- tokens_remove(toks, pattern = c(
"fellow*",
"citizen*",
"senate",
"house",
"representative*",
"constitution"
))
# Build N-grams (onegrams, bigrams, and 3-grams)
toks_ngrams <- tokens_ngrams(toks, n = 2:3)
# Build N-grams based on a structure: keep n-grams that containt a "not"
toks_neg_bigram_select <- tokens_select(toks_ngrams, pattern = phrase("never_*"))
head(toks_neg_bigram_select[[1]], 30)## [1] "never_hear" "never_expected" "never_hear_veneration" "never_expected_nation"
To create a dtm, we can use quanteda’s dfm
command, as shown below.
## Document-feature matrix of: 5 documents, 1,818 features (72.28% sparse) and 4 docvars.
## features
## docs among vicissitudes incident life event filled greater anxieties notification transmitted
## text1 1 1 1 1 2 1 1 1 1 1
## text2 0 0 0 0 0 0 0 0 0 0
## text3 4 0 0 2 0 0 0 0 0 0
## text4 1 0 0 1 0 0 1 0 0 0
## text5 7 0 0 2 0 0 0 0 0 0
## [ reached max_nfeat ... 1,808 more features ]
dfmat <- dfm(toks)
dfmat <- dfm_trim(dfmat, min_termfreq = 2) # remove tokens that appear less than 1 times
## government may public can people shall country every us
## 40 38 30 27 27 23 22 20 20
## nations
## 18
# compute word frequencies as top feature
tstat_freq <- textstat_frequency(dfmat, n = 5)
# visualize frequencies in word cloud
textplot_wordcloud(dfmat, max_words = 100)
(ref:rgb)
## Loading required package: sp
## Linking to ImageMagick 6.9.12.98
## Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fontconfig, x11
We can generate images directly in R by populating arrays and saving the plots to disk.
# Step 1: Define the width and height of the image
width = 300;
height = 300
# Step 2: Define the number of layers (RGB = 3)
layers = 3
# Step 3: Generate three matrices corresponding to Red, Green, and Blue values
red = matrix(255, nrow = height, ncol = width)
green = matrix(0, nrow = height, ncol = width)
blue = matrix(0, nrow = height, ncol = width)
# Step 4: Generate an array by combining the three matrices
image.array = array(c(red, green, blue), dim = c(width, height, layers))
dim(image.array)## [1] 300 300 3
## class : RasterBrick
## dimensions : 300, 300, 90000, 3 (nrow, ncol, ncell, nlayers)
## resolution : 0.003333333, 0.003333333 (x, y)
## extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
## crs : NA
## source : memory
## names : layer.1, layer.2, layer.3
## min values : 255, 0, 0
## max values : 255, 0, 0
## Warning in .couldBeLonLat(x, warnings = warnings): CRS is NA. Assuming it is longitude/latitude
# Step 7: (Optional) Save to disk
png(filename = "red.png", width = width, height = height, units = "px")
plotRGB(image)## Warning in .couldBeLonLat(x, warnings = warnings): CRS is NA. Assuming it is longitude/latitude
## png
## 2
# Common Packages for Vector Files
library(xml2)
# Download and read svg image from url
URL <- "https://upload.wikimedia.org/wikipedia/commons/1/1b/R_logo.svg"
Rlogo_xml <- read_xml(URL)
# Data structure
Rlogo_xml ## {xml_document}
## <svg preserveAspectRatio="xMidYMid" width="724" height="561" viewBox="0 0 724 561" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
## [1] <defs>\n <linearGradient id="gradientFill-1" x1="0" x2="1" y1="0" y2="1" gradientUnits="obje ...
## [2] <path d="M361.453,485.937 C162.329,485.937 0.906,377.828 0.906,244.469 C0.906,111.109 162.329 ...
## [3] <path d="M550.000,377.000 C550.000,377.000 571.822,383.585 584.500,390.000 C588.899,392.226 5 ...
## <svg [preserveAspectRatio, width, height, viewBox, xmlns, xmlns:xlink]>
## <defs>
## <linearGradient [id, x1, x2, y1, y2, gradientUnits, spreadMethod]>
## <stop [offset, stop-color, stop-opacity]>
## <stop [offset, stop-color, stop-opacity]>
## <linearGradient [id, x1, x2, y1, y2, gradientUnits, spreadMethod]>
## <stop [offset, stop-color, stop-opacity]>
## <stop [offset, stop-color, stop-opacity]>
## <path [d, fill, fill-rule]>
## <path [d, fill, fill-rule]>
# Raw data
Rlogo_text <- as.character(Rlogo_xml)
# Plot
svg_img = image_read_svg(Rlogo_text)
image_info(svg_img)## # A tibble: 1 × 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 724 561 sRGB TRUE 0 72x72
# For Optical Character Recognition
library(tesseract)
# fetch and show image
img <- image_read("https://s3.amazonaws.com/libapps/accounts/30502/images/new_york_times.png")
print(img)## # A tibble: 1 × 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 806 550 sRGB FALSE 714189 38x38
## The New Work Times. [==S=
##
## TITANIC SINKS FOUR HOURS AFTER HITTING ICEBERG;
## 866 RESCUED BY CARPATHIA, PROBABLY 1250 PERISH;
## ISMAY SAFE, MRS. ASTOR MAYBE, NOTED NAMES MISSING
XPATH goes beyond the basic introduction to XML covered in this course and is thus not an exam-relevant topic. If you are interested in learning more about XPATH, W3-schools provides a great introductory tutorial to the topic: https://www.w3schools.com/xml/xpath_intro.asp.↩︎